GROUP BY clause
Back to DuckDB Data Engineering Glossary
Overview
The GROUP BY
clause is a fundamental SQL operation that allows you to organize rows into groups based on one or more columns, enabling aggregate calculations on each group rather than the entire dataset. When you use GROUP BY
, each group will contain all rows that share the same values in the specified grouping columns.
Basic Usage
The most common use of GROUP BY
involves aggregating data by a single column. For example, to count orders by country:
Copy code
SELECT country, COUNT(*) as order_count
FROM orders
GROUP BY country;
Multiple Column Groups
You can group by multiple columns to create more specific groupings. The order of columns in the GROUP BY
doesn't affect the results:
Copy code
SELECT country, product_category, SUM(revenue) as total_revenue
FROM sales
GROUP BY country, product_category;
DuckDB-Specific Features
DuckDB extends the standard GROUP BY
functionality with helpful features like GROUP BY ALL
, which automatically groups by all non-aggregated columns in the SELECT
clause:
Copy code
-- These queries are equivalent in DuckDB
SELECT country, region, SUM(sales)
FROM orders
GROUP BY country, region;
SELECT country, region, SUM(sales)
FROM orders
GROUP BY ALL;
Common Pitfalls
When using GROUP BY
, all columns in the SELECT
clause must either be included in the GROUP BY
clause or be wrapped in an aggregate function (like SUM
, COUNT
, AVG
). DuckDB will return an error if this rule is violated, helping prevent accidental misuse of grouping operations.
Related Concepts
The GROUP BY
clause is often used with:
HAVING
clause to filter grouped results- Window functions for more complex aggregations
ORDER BY
to sort the grouped results- Aggregate functions like
COUNT
,SUM
,AVG
,MAX
, andMIN
For more advanced grouping operations in DuckDB, look into GROUPING SETS
, ROLLUP
, and CUBE
clauses which provide additional ways to aggregate data at multiple levels simultaneously.